Mining Time Series Motifs with Probabilistic Guarantees
Matteo Ceccarello
University of Padova
matteo.ceccarello@unipd.it
Joint work with Johann Gamper (U. Bolzano)
Outline
Problem definition
State of the art
Locality Sensitive Hashing
Our algorithm
Experimental results
Timeseries and subsequences
Fix a subsequence length \(w\)
The Euclidean distance measures the similarity between subsequences
Subsequences can be seen as points in a \(w\) dimensional space
This timeseries is from an experiment recording the behavior of a Circulifer tenellus insect. In particular, the subsequences below are associated with a particular behavior while feeding.
Top-\(k\) motifs
Definition 1 Given a time series \(T\) and a length \(w\), the top-\(k\) motifs are the \(k\) pairs of subsequences of \(T\) of length \(w\) with smallest distance, such that no two subsequences in any pair overlap with each other.
State of the art: Matrix Profile
Algorithm
For each subsequence, find the nearest neighbor
Find the subsequence with the closest nearest neighbor
Pros
Leverage the fact that consecutive subsequences share a lot of structure
Parallelize with many GPUs
Finds motifs out of 1 billion long time series in one day, using 5 GPUs
Cons
it’s a \(\Omega(n^2)\) algorithm, for timeseries of length \(n\)
Goal
Find the top motifs without computing all-pairs-distances
Challenges
Subsequences can be seen as vectors
These vectors are high dimensional
Curse of dimensionality: indices such as R-trees degrade to linear scans
Locality Sensitive Hashing
Subsequences of length \(w\) are points in \(R^w\), with Euclidean distance.
The key point is that we only compute the distance of subsequences falling into the same bucket.
To lower the collision probability we concatenate \(\tau\) hash functions \[
\hat{h}(\vec{x}) = \langle h_1(\vec{x}), \dots, h_\tau(\vec{x}) \rangle
\] this makes for a better precision of the index.
And to increase the recall of the index we repeat \(L\) times.
Computing the success probability
Consider a distance \(d\)
Assume to have done \(L\) LSH repetitions with \(\tau\) concatenations
We can compute the probability of having seen at least once all pairs at distance \(d\)
A simple algorithm
Auto tuning parameters
How do we set \(\tau\)?
The wrong value might make us do too many repetitions
We adopt an approach that auto tunes the parameters based on the data at hand
\(L_{max}\) maximum number of repetitions,
\(K_{max}\) maximum number of concatenations (e.g. 4),
\(\delta\) probability parameter: succeed with probability \(\ge 1-\delta\).
Repetition 1
Repetition 2
In each iteration we compute the distance of all subsequences in the same bucket.
In the first iteration, with \(k=4\), we discover a pair at distance 2 .
After 10 repetitions, we find a pair at distance 1 .
After 100 repetitions we did not find a better pair, and the success probability is about 0.85
We then consider shorter prefixes of the hashes, going through the 100 repetitions again.
After 15 repetitions, we observe that the success probability is above our target, and thus return.
Guarantees
Theorem 1 Our algorithm finds the exact top-\(k\) motifs each with probability \(\ge 1-\delta\).
For \(\delta=0.1\), each true motif is found with 90% probability
This means that we can control the recall of the algorithm
Tradeoff: the higher the desired recall, the slower the discovery process
Complexity
Theorem 2 The LSH index construction takes time \[
O(K_{max} \cdot \sqrt{L_{max}} n\log n)
\]
Theorem 3 Let \(d(m_k)\) be the distance of the \(k\)-th motif, and \(i'\le K\), \(j' \le L\) be the parameters used by the optimal LSH algorithm. Then, the algorithm considers \[
O\left(
j'\sum_{m\in T^w\times T^w} p(d(m))^{i'}
+
(L-j')\sum_{m\in T^w\times T^w} p(d(m))^{i'+1}
\right)
\] pairs in expectation.
Optimizations
Use a trie data-structure to re-use computations across iterations at different \(k\) values
Compute dot producs for hash values in the frequency domain (also done in some implementations of the Matrix Profile)
For shorter time series, or if the relative contrast is very small, use the Matrix Profile.
For time series of a few million values and above, with a not-to-small relative contrast, try Attimo
References
Matteo Ceccarello, Johann Gamper: Fast and Scalable Mining of Time Series Motifs with Probabilistic Guarantees. Proc. VLDB Endow. 15(13): 3841-3853 (2022)
Running for top-10 motifs, for different number of repetitions.
Difficult datasets
Data from LIGO:
Length 100k, window 1k
Top motif distance \(\approx 40\)
Average distance \(\approx 44\)
Relative contrast \(\approx 1.1\)
Attimo: 6 hours
SCAMP: \(\approx 1\) second
freezer and the 7-th motif
7M points, window 5000
In black are the distances of the top-10 motifs extracted from the matrix profile.
In red the distance of a pair of subsequences neither of which is the nearest neighbor of the other, and not overlapping with higher-ranked motifs.
The matrix profile holds the distances and indices of the 1-nearest neighbor of each subsequence, but top-k motifs would require the k-nearest neighbors to be maintained in the matrix profile.
Top-k and k-NN
Solid lines are nearest neighbor distances
The dashed line is the distance of the top-2 pair in the definition we are using
The Matrix Profile contains information about 1-nearest neighbors (solid black lines)